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We derive rates of contraction of posterior distributions on non- 
parametric or semiparametric models based on Gaussian processes. 
The rate of contraction is shown to depend on the position of the 
true parameter relative to the reproducing kernel Hilbert space of 
the Gaussian process and the small ball probabilities of the Gaussian 
process. We determine these quantities for a range of examples of 
Gaussian priors and in several statistical settings. For instance, we 
consider the rate of contraction of the posterior distribution based 
on sampling from a smooth density model when the prior models the 
log density as a (fractionally integrated) Brownian motion. We also 
consider regression with Gaussian errors and smooth classification 
under a logistic or probit link function combined with various priors. 

1. Introduction. Gaussian processes have been adopted as building blocks 
for constructing prior distributions on infinite-dimensional statistical models 
in several settings. For instance, in the setting of nonparametric density esti- 
mation, a prior distribution on a collection of probability densities (relative 
to a measure v) can be defined structurally as the random density 

e w * 

where (W x : x E X) is a Gaussian process indexed by the sample space X of 
the observations. The Gaussian process is exponentiated to force the prior to 
charge only nonnegative functions, and is next renormalized to integrate to 
unity. Several other constructions have also been considered in the literature, 
for density estimation as well as other statistical problems; see Section 3 and 
[4, 5, 9, 12, 15, 19, 20, 21, 22, 26, 29, 34]. The book [27] makes a connection to 
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machine learning and the website http://www.gaussianprocess.org lists 
additional references. 

Given a prior and observations, Bayes' rule yields a posterior distribution 
on the parameter space. In the frequentist set-up, in which the data are 
sampled from a fixed "true" distribution and the amount of information 
in the data increases indefinitely, the corresponding posterior distributions 
often contract to the fixed true distribution, which is referred to as posterior 
consistency. In this paper, we study the rate of contraction of the posterior 
distribution relative to global metrics on the parameters. 

In most cases, the Gaussian process can be viewed as a tight Borel mea- 
surable map in a Banach space, for instance, a space of continuous functions 
or an L p -space. It is well known that the support of a centered (i.e., zero- 
mean) version of such a process (the smallest closed set having probability 
one under the induced measure) is equal to the closure of the reproducing 
kernel Hilbert space (RKHS) of the covariance kernel of the process. Because 
the posterior distribution necessarily puts all of its mass on the support of 
the prior, it follows that consistency can be valid only if the parameter wq 
defining the true distribution of the data belongs to this support. In the 
present paper, we prove that the rate of contraction in that case is express- 
ible in terms of the function 

(1.2) 4> wo {e)= inf „ - logPr(||W|| < e). 

heU:\\h— iuo||<£ 

In this expression, || • || is the norm of the Banach space in which the Gaussian 
process W takes its values, H is the reproducing kernel Hilbert space of the 
process and || • ||e the RKHS-norm. If the norm || • || on the sample space 
of the process "combines correctly" with the norm on the parameter space 
and n describes the informativeness of the data in the usual way, then the 
posterior contracts at the rate e n — > 0, satisfying 

(1.3) 4>w {e n ) <ne 2 n . 

This is the case, for instance, in density estimation on the unit interval with 
the log Gaussian process prior given in (1.1) and || • || the uniform norm 
given by =sup{|u;(x)| :x £ X}. This is also the case for regression and 
classification, with appropriate norms, as shown below. The rate of contrac- 
tion e n thus depends on the position of the true parameter wq relative to 
the RKHS and the amount of mass Pr(||W|| < e) that the prior distribution 
puts in small balls around zero. In Section 4, we compute these quantities 
for a range of priors. 

For instance, we prove that log Gaussian densities given in (1.1), combined 
with Brownian motion, yield a rate of contraction of ra^ 1 / 4 whenever the 
logarithm of the true density is a-smooth for some a > 1/2 and yield the 
slower rate n~ a / 2 for < a < 1/2. That higher smoothness (a large) does not 
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improve the rate of contraction is disappointing, but perhaps not surprising, 
given that the Brownian paths themselves are 1/2-smooth: the data are not 
capable of smoothing out the prior roughness of the sample paths. Other, 
more smooth, Gaussian priors give better rates for smooth truths (depending 
on their RKHS), but worse rates for rough truths. In Section 4, we exhibit, 
for every possible smoothness level a, Gaussian priors that give the optimal 
rate of contraction if the true parameter possesses regularity a. 

The function (f> wo displayed in (1.2) may seem complicated at first. In 
fact, it can be handled for many examples. In particular, the probability 
Pr(|| W|| < e) = e~^°( e ) is known as the small ball probability for e J. and has 
been studied in many papers in the probability literature (see [24] or the ex- 
tensive bibliography compiled by M. Lifshits on http://www.proba.jussieu. 
fr/pageperso/smalldev/biblio.pdf). In Section 4, we discuss a number of ex- 
amples. The centered small ball probability exponent (j>o(s) puts a limit on 
the rate of contraction that depends only on the prior, while the decen- 
tered small ball probability shows how this rate might deteriorate by the 
positioning of the true parameter wo relative to the support of the prior. 

The paper is organized as follows. In Section 2, we recall the definition of 
the RKHS of a Gaussian process and state theorems on the concentration of 
Gaussian processes that are the basis of the remainder of the paper. In Sec- 
tion 3, we state our main results on posterior concentration for a number of 
statistical settings. Next, in Section 4, we discuss a number of special Gaus- 
sian processes and derive the rates of posterior contraction corresponding to 
true parameters of various regularity levels. Section 5 contains the proofs. 

The notation < is used for "smaller than or equal to a universal constant 
times" and x means "proportional up to constants." We let || • \\ p l) denote the 
norm of L p (v), the space of measurable functions with z/-integrable pth abso- 
lute power. Furthermore, h(f,g) = ||v7 — v^Hv * s the Hellinger distance, 
K(f,g) the Kullback-Leibler divergence, and V(f,g) = J (log(/ / g)) 2 f dv . 
If the dominating measure v is Lebesgue measure, then it may be omit- 
ted in the notation. The notation C[0, 1] is used for the space of continu- 
ous functions / : [0, 1] — ► R endowed with the uniform norm and, for /3 > 0, 
we let C^fO, 1] denote the Holder space of order j3, consisting of the func- 
tions / £ C[0, 1] that have (3 continuous derivatives for j3 the biggest integer 

strictly smaller than (5 with the /3th derivative being Lipshitz continu- 
ous of order /? — /?. Finally, H k [0, 1] denotes the Sobolev space of functions 
/ : [0, 1] — > R that are k — 1 times continuously differentiable with absolutely 
continuous (k — l)th derivative that is the integral of a function £ -^[0, 1] 
and £°°(X) is the space of bounded functions z: X — > K with the uniform 
norm ||z||^- = sup{|z(a;)| : x E X} (also written as ||z||oo)- 

2. Gaussian priors. In this section, we first recall the definition of the 
RKHS and next formulate results on the support of Gaussian processes 
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which will be basic to the results on rates of posterior contraction in the next 
section. The proofs of the results in this section are deferred to Section 5. 
Relevant results on RKHS are scattered throughout the literature. Van der 
Vaart and Van Zanten [30] reviews facts that are relevant to the present 
applications. 

The definition of an RKHS that is most appropriate for the results in this 
paper concerns Gaussian random elements seen as Borel measurable maps 
in a Banach space. A Borel measurable random element W with values in a 
separable Banach space (B, || • ||) is called Gaussian if the random variable 
b*W is normally distributed for any element b* of the dual space B* of B and 
it is called zero-mean if the mean of every such variable b*W is zero. The 
reproducing kernel Hilbert space (RKHS) H attached to W is the completion 
of the range SM* of the map S : B* -> B defined by 

Sb* = EWb*{W), 6*GB*, 

for the inner product 

(Sb* 1 ,Sb* 2 ) m = Eb* 1 (W)b* 2 (W). 

The element Sb* £ B is the Pettis integral of the B-valued random element 
Wb*(W)— an element of B such that b* 2 {Sb*) = Eb* 2 (W)b*(W) for every b* 2 £ 
B* (cf. [18], page 42). It can be shown the RKHS-norm on the set SM* is 
stronger than the original norm, so the RKHS H, the completion of the set 
SM* under the RKHS-norm, can be identified with a subset of B. 

A zero-mean Gaussian stochastic process W = (Wt :t € T) defined on 
some probability space (£l,U,Pr) and indexed by an arbitrary set T with 
bounded sample paths 1 1— > Wt can be viewed as a map into the Banach space 
£°°(T). If it is Borel measurable and has separable range, then its RKHS is 
defined above. It can be shown (e.g., [30]) that this RKHS can be identified 
with the completion of the set of maps 

(2.1) t^^2aiK(ai,t)=EWtH, H = Y t o i W ai , 

i i 
under the inner product 

(EW.H 1 ,EW.H 2 )m = EHtH 2 . 

Here, K(s,t) = EW s Wt is the covariance function of the process and H 
ranges over all finite linear combinations. This completion is precisely the 
set of functions 1 1— > EWtH with H ranging over the closure of the set of 
linear combinations H = J2 i aiW Si in L 2 (Q,U,Pr). 

For e > 0, let N(e,B,d) denote the minimum number of balls of radius e 
needed to cover a subset B of a metric space with metric d (cf. [31]). 
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Theorem 2.1. Let W be a Borel measurable, zero-mean Gaussian ran- 
dom element in a separable Banach space (B, || • ||) with RKHS (H, || • ||u) 
and let wq be contained in the closure ofW in B. For any numbers e n > 
satisfying (1.3) for (j> wo given by (1.2), and any C> 1 with e~ Cn£ ™ < 1/2, 
there exists a measurable set B n C B such that 

(2.2) \ogN(3e n ,B n ,\\-\\)<6Cne 2 n , 

(2.3) Vi(Wi B n ) < e~ Cn ^, 

(2.4) Pr(||W-w || <2e„) >e" ne ". 

The three assertions of this theorem can be matched one-to-one with the 
assumptions of general results on rates of posterior contraction (e.g., Theo- 
rem 2.1 of [8]), except that the assertions here use the norm of the Banach 
space, whereas the conditions for the posterior rates are in terms of metrics 
or discrepancies appropriate to the statistical problem under consideration. 
The rate of contraction e n is obtained as soon as the latter metrics are com- 
parable to the norm. This is shown to be the case for various statistical 
settings in the next section. 

The preceding theorem is meant to be used as an asymptotic result as 
n — > oo, but is, in fact, a statement for every fixed n. The Gaussian process 
W and the "true" parameter wo may therefore also be taken to be dependent 
on n, as long as the corresponding RKHS and function <f) Wn are also taken 
to be dependent on n. 

In the context of sequences of Gaussian processes that approximate a 
fixed process, such as truncated Fourier series, working with a sequence of 
concentration functions would be unnecessarily cumbersome. We have the 
following refinement, which shows that we can use the concentration function 
of the limit process in such cases. 

Theorem 2.2. Let W n be Borel measurable, zero-mean, jointly Gaus- 
sian random elements in a separable Banach space (B, || • ||) such that 10E|| W n — 
W\\ 2 < 1/n for a Gaussian process W. Let (H, || • ||h) be the RKHS of W 
and assume that wo is contained in the closure ofW inM. For any numbers 
e n > satisfying ne 2 n > 41og4 and (1.3) with (j) Wo given by (1.2), and any 
C > 4 with e~ Cn£n < 1/2, there exists a measurable set B n C B such that 
(2.2), (2.3) and (2.4) hold with W replaced by W n and e n replaced by 2e n . 

The sum W = W l of finitely many independent Gaussian processes W l 
is itself a Gaussian process. It appears that it is not always easy to obtain its 
RKHS from the RKHS's of the components W l . However, the concentration 
function of W can easily be obtained from the concentration functions of 
the components. 
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Theorem 2.3. Let W = J2i^jW' 1 be the sum of finitely many indepen- 
dent Borel measurable, zero-mean Gaussian random elements in a separable 
Banach space (B, || • ||) with concentration functions (f> 1 t for given io'eB. 
Then, the concentration function cp w of W around w = J2 ie j w% satisfies 

^( £ |/|)<2^^(e/2). 
iei 

The theorem applies, in particular, to a sum V + W where W possesses 
the desired properties (2.2), (2.3) and (2.4) and V is more concentrated at 
zero than W, in the sense that Pr(|| V|| < e) > Pr(|| W|| < e) for every e > 0. 
The theorem with W 1 = V, w 1 = and W 2 = W shows that V will not 
destroy good properties of W in that case. 

It is natural to scale a Gaussian process so that its fluctuations are of 
the same order of magnitude as the fluctuations thought to exist in the true 
parameter wq. Lacking sufficient prior insight regarding wq, one might use 
a hierarchical prior of the form AW, where the scale parameter A is chosen 
from some distribution on (0, oo), independent of the Gaussian process W. 
The preceding results extend to this prior if the support of the prior for A is 
bounded above. (The rate deteriorates if the scale parameter is not bounded 
away from infinity. We do not discuss this case here.) 

Theorem 2.4. Let W be a Borel measurable, zero-mean Gaussian ran- 
dom element in a separable Banach space (B, || • ||) independent of the random 
variable A that takes its values in an interval (0,-fT] C (0,oo). Let wq be con- 
tained in the closure of the RKHS H of W in B. Let k < 1 < K. For any 
numbers e n > satisfying (1.3) for cp wo given by (1.2), and any C > 1 with 
e -Cne n ^ \/2, there exists a measurable set B n C B such that 

(2.5) logN(3Ke n ,B n ,\\-\\)<6Cne 2 n , 

(2.6) Pi(AW$B n ) <e- Cn£ ', 

(2.7) Pr(||AW - wo|| < 2Ke n ) > Pr(A > k)e~ n£ " /k2 . 

3. Main results on posterior contraction. Gaussian processes can be 
used as building blocks for constructing priors on function spaces in various 
ways and in several statistical settings. In order for our general approach 
to apply, appropriate metrics on the set of distributions of the observations 
must correspond to the norm of the Banach space in which the Gaussian 
process takes its values. In this section, we describe several cases where this 
desirable situation is achieved. These are motivated by implementations in 
the literature and do not form an exhaustive set. 
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3.1. Density estimation. Suppose that we observe an i.i.d. sample X±, . . . , 
X n from a density po relative to a measure v on & measurable space (X,A). 
Consider a prior distribution on the set of ^-densities defined structurally 
as pw for a Gaussian process W and, for p w , the function defined by 

e Wx 

PW(X) = f x e™ydv(y)- 

(The notation p now denotes both the true density and the density p w 
with w = 0.) Implementations of this prior were considered in [19, 20, 22] 
addressing, for instance, the computation of the posterior mean. 

Assume that W has bounded sample paths and can be viewed as a 
Borel measurable map in the space £°°(X) of bounded functions z : X — ► K 
equipped with the uniform norm. The following theorem shows that the rate 
of contraction for log Gaussian prior densities is determined exactly as in 
(1.2)-(1.3), with w = logp - 

Theorem 3.1. Let W be a Borel measurable, zero-mean, tight Gaus- 
sian random element in £°°{X). Suppose that wq = \ogpo is contained in 
the support of W and let cp Wo be the function in (1.2) with || • || the uni- 
form norm on £°°{X). Then, the posterior distribution relative to the prior 
p w satisfies E U n {p w : h(p w ,p ) > Me n \Xi, X n ) -* for any sufficiently 
large constant M and e n given by (1.3). 

Proof. The proof of the theorem is based on Theorem 2.1 of [8] and 
a comparison of the Hellinger and Kullback-Leibler distances between log 
Gaussian prior densities to the uniform distance on the Gaussian process, 
as in the lemma below. 

We choose the set V n of [8] equal to {p w :w G B n }, where B n C £°°(X) is 
the measurable set as in Theorem 2.1, with C a large constant. In view of 
the first inequality of Lemma 3.1, for sufficiently large n, the 4e n -entropy of 
V n relative to the Hellinger distance is bounded above by the 3e n -entropy of 
the set B n relative to the uniform distance, which is bounded by 6Cne^, by 
Theorem 2.1. This verifies (2.2) of [8]. The prior probability 11(7^) outside 
the set V n , as in (2.3) of [8], is bounded by the probability of the event 
{W ^ B n }, which is bounded by e~ CnSn , by Theorem 2.1. Finally, by the 
second and third inequalities of Lemma 3.1, the prior probability as in (2.4) 
of [8], but with e n replaced by a multiple of e n , is bounded below by the 
probability of the event {\\W — wolloo < 2e n }, which is bounded below by 

2 

e~ n6n , by Theorem 2.1. □ 

Lemma 3.1. For any measurable functions v,w:X— >• R, we have the 
following: 
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w \\ e ll— Hoc/2. 



HPv,Pw) < \\v 
K(p v ,p w ) < \\v - wWlce^-™ 
V{Pv,Pw) < \\v - wW 2 ^-^ 



: (! + \\v 
+ 



- w\\oo); 



Proof. The triangle inequality and simple algebra give 



h(Pv,Pw) 



a v/2 



s w/2 



ov/2\ 



< V- 



oV/2 _ & w/2\ 
HfiW2|L 



|2 \\C ' \\2 

Because \e v l 2 - e w / 2 \ = e w l 2 \e v / 2 - w l 2 - 1| < e u, / 2 e l 1 '-™l/ 2 |t; 
the square of the right-hand side is bounded by 

w\ 2 dv 



w\/2 for any 



je w e\ v ~ 



< glk-^llc: 



/ e w dv 

This proves the first assertion of the lemma. We derive the other assertions 
from the first using the equivalence of K, V and the Hellinger distance if 
the quotient of the densities is uniformly bounded. Because w — \\v — io||oo < 
v <w + \\v — u>||oo, we have 



e w dv e 



00 < e v dv< 



Taking logarithms, we see that 
u; I loo. Therefore, 



w 



e w dv e ^~ w ^. 
< log{Je v dv/Je w dv) < \\v 



, Pv 

log 




v — w 


Pw 


00 





W — lot 



Je v dv 



< 2\\v 



w\ 



J e w dv 

The second and third inequalities now follow from the first by Lemma 8 of 
[10]. □ 

3.2. Classification. Suppose that we observe a random sample of vec- 
tors (Xi, Y\), . . . , (X n , Y n ) from the distribution of (X,Y), where Y takes 
its values in the set {0, 1} and X takes its values in some measurable space 
(X,A). Consider estimating the binary regression function fo(x) = Pr(Y = 
1\X = x). Given a fixed, measurable function i&:X — > (0,1), we may con- 
struct a prior on the set of regression functions as f\y for a Gaussian process 
W = (W x : x G X) and f w the function 

f w (x) = ^(w x ). 

Here, w x denotes the value at x of a function w : X — » R. The likelihood for 
(X,Y) factorizes as 

Pw{x,y) = f w {x) y (l - f w {x)) l ~ y g(x), 

that is, into the conditional likelihood of Y given X and the marginal like- 
lihood g for X. As this causes the marginal density g to cancel from the 
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posterior distribution for f w , it is not necessary to put a prior on g. We 
can set the distribution of X equal to the "true" distribution G into all of 
the following and incorporate it into the dominating measure v so that the 
factor g can be omitted from the likelihood. We assume that /o is never 
zero and then, with some abuse of notation, can define wq by the equation 
/o = ¥(too). 

The link function is assumed to be a known differentiable function with 
bounded derivative ip. Link functions that lead to agreement between the 
metrics on the set of densities p w and the Z/2-metric on the set of functions 
w are especially attractive in the present context. The logistic link function 
qualifies in this respect. More generally, there is perfect agreement whenever 
the function — is uniformly bounded. 

Implementations of this model are described in [19, 20, 27, 34]. The first 
three follow [1] in defining latent variables and setting up an MCMC scheme. 
For probit regression, the latent variable is a single Gaussian variable Zi 
which models Y{ = lz 4 >o and, given Xi = x, possesses mean w x and variance 
1. Logistic or other link functions are approximated by scale mixtures of 
Gaussian links, with an additional latent scale variable. [27] proposes to 
compute an approximation to the posterior distribution, either of Laplace 
form or by an algorithm termed "expectation propagation," both applicable 
to general priors. 

Theorem 3.2. (i) Suppose that the function ip/(^>{\ — \£)) is bounded. 
Let W be a Borel measurable, zero-mean, tight Gaussian random element 
in Li(G). Suppose that wq = ^~ 1 (fo) is contained in the support of W and 
let (j> Wo be the function in (1.2) with \\ ■ \\ the L2(G)-norm. Then, the poste- 
rior distribution relative to the prior pw satisfies EoII n (w; : \\p w — Po\\g,2 > 
Me n \Xi, Y\, . . . , X n , Y n ) — > for any sufficiently large constant M and e n 
given by (1.3). 

(ii) Suppose that the function wq = v I / ~ 1 (/o) is bounded. Let W be a Borel 
measurable, zero-mean, tight Gaussian random element in £°°(X). Suppose 
that wq is contained in the support ofW and let <f> wo be the function in (1.2) 
with || • || the uniform norm. The same conclusion is then true. 

Proof. This follows from combining Theorem 2.1 of [8] and Theo- 
rem 2.1 of the present paper, in the same fashion as Theorem 3.1 was proved 
by combining these two results. The details are as follows. 

(i) Because the densities p w are uniformly bounded, the Z/2-norm on the 
set of densities is bounded above by a multiple of the Hellinger distance. 
Thus, we can apply Theorem 2.1 of [8] with d equal to the L2(G)-norm. The 
square L2(G)-norm and the Kullback-Leibler quantities K and V on the 
densities p w are all bounded above by multiples of the square L2(G)-norm 
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on the functions w, by Lemma 3.2 below. Therefore, Theorem 2.1, with || • || 
the L2(G)-norm, allows us to bound the quantities in Theorem 2.1 of [8]. 

(ii) If the function u>o = 1 $>~ 1 (fo) is bounded, then so are the functions in 
a uniform neighborhood of it and so is the function -0/(^(1 — ^)) on the 
relevant domain. The proof can next be completed as before. □ 

The theorem can be extended to link functions with an unbounded func- 
tion tp/(^(l — $)), even if the function wq = v I / ~ 1 (/o) is unbounded, by using 
appropriate norms on the Gaussian process. For instance, the probit link 
function can be treated as soon as the function wq = ^ / ~ 1 (/o) is contained 
in L^iG), with a combination of the L2((wq V 1) • G) and L4(G)-norms on 
the Gaussian process. This can be proven in the same way as the preceding 
theorem, using Lemma 3.2 below. 

For general link functions, the relationship between the appropriate norms 
on the densities and the norm on the Gaussian process is moderated by the 
function S : R 2 — ► R given by 



S(w,w ) 



sup 

v:v£[w,wo]U[w(),w 



tf(l-tf) 



VI. 



w\\r,o; 



Lemma 3.2. If possesses a bounded derivative tp, then, for any mea- 
surable functions v, w : X — > R and any r > 1, we have the following: 

• \\Pv - Pw\\r = 2 1/r \\V(v) - *M ||r,G < ^ 

• K(p w ,p Wo )<\\{w-wo)yJS(w,v 

• V{p w ,p W0 ) < \\(w - w )S(w,wo) 

For the distribution function of the logistic distribution, the function S 
is uniformly bounded. For ^ the distribution of the normal distribution, 
K(p w ,p wo ) andV(p 

wiPwo) a ?"6 bounded above by a multiple of || to — wo|||,Go 

+ 

|4 



)J\\2,G> 
2 

2,G- 



|| w — Wq \\q 4 where Go is the measure defined by dGo = (wq V 1) dG. 

Proof. The first assertion follows immediately from the fact that \p v (x, 0) ■ 
p w (x,0)\ = \p v (x,l) — p w (x,l)\ = \^(v x ) — ^f(w x )\ for any x. For the second 
inequality, we consider, for fixed wo E R, the function g Wo : R — ► R given by 

1 - ^(w ) 



fj 



*(u> ) log ^4 + (!-*(«*,)) log 



The derivative of this function is g' w Aw) = (0/^(1 — ^))(w)(^(w) — ^(wo)). 
In view of the definition of S and Taylor's theorem, it follows that |fl'( 0o (w)| < 
S(w,w )(w - w n 
K{p w ,p Wo ) = f g Wo (w)dG. 



2 . The second assertion is then clear from the fact that 



For the third inequality, we note that, by Taylor's theorem, 



log 



V 



log 



1 - 

1 - *(io ) 



< S(w,wq)\w — Wq\. 
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Since V(p w ,p wo ) is a weighted integral of the squares of the quantities on 
the left-hand side, the third inequality follows. 

For ^ the logistic distribution function, the function ^/(^(l — is easily 
seen to be bounded. The normal distribution function \J r satisfies — 
as i-t ±00 and is hence bounded by a multiple of \x\ V 1. It 
follows that \S(w,wq)\ < (\wq\ + \ wq — w\) V 1. Substituting this into the 
bounds on K(p w ,p Wo ) and V(p w ,p WQ ) readily yields the last assertion of the 
lemma. □ 

3.3. Regression with fixed covariates. Suppose that we observe indepen- 
dent variables Yi,...,Y n following the regression model Yj = wq{xi) + e$ 
for unobservable N(0, <TQ)-distributed errors e, and fixed, known elements 
xi, . . . ,x n of a set X. Consider estimating the regression function wq. 

As a prior on w, we use a Gaussian process (W x : x E X). As this is con- 
jugate, implementation is straightforward (see, e.g., [27], Chapter 2). If the 
standard deviation a of e is not known, then we may also put a prior on a, 
which we assume to be supported on a given interval [a, b] C (0, 00) with a 
Lebesgue density that is bounded away from zero. Unfortunately, the pop- 
ular inverse Gamma prior does not satisfy the latter condition. 

The natural semimetric for this problem is the L2(P^)-norm for the em- 
pirical measure P^ = ft X)£=i<5 Zj °f the design variables. For fixed n, the 
Gaussian stochastic process (W x '■ x G X) is important at the design points 
only and must be viewed as a Borel measurable map in the Banach space 
Z/2(P^). As this varies with n, it is more convenient to view it as a map in 
the space i°°(X) of bounded functions on X, whose norm is stronger than 
any of the L2(P^)-norms. 

Theorem 3.3. Let W be a zero-mean, tight Gaussian random element 
in i°°(X) and suppose that wq is contained in the support of W . Further- 
more, let S be a random variable with values in an interval [a, b] C (0, 00) 
that contains ctq. Let (j> WQ be the function in (1.2) with \\ ■ \\ the supremum 
norm on£ <yD (X). Then, the posterior distribution satisfies EoII n ((w;, a) : \\w — 
wo\\n + \o~ — o"o I > Me n \Yi, . . . ,Y n ) — > for any sufficiently large constant M 
and e n given by (1.3). 

Proof. Let || • || n be the L2(P^)-norm. For the case where the prior on 
a is degenerate at the true value, it is shown in [11] that the rate of posterior 
contraction is faster than e n , for which there exist sets W n satisfying 



logJV(e n ,W, 




nn(W£)<e 
U n (w : \\w - w Q \\ n <e n )>e 



—ne: 



2 



n 
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This result is based on comparisons of the Kullback-Leibler divergence and 
variance to the square of the norm || • || n , and the construction of tests. It 
can be extended to the present case of an unknown scale that is bounded 
away from zero and infinity. The theorem then follows from Theorem 2.1. 

□ 

3.4. White noise model. Suppose that we observe a sample path of the 
stochastic process = (X t :0 < t < 1), defined structurally as, for a 
given function wo £ L2[0, 1], 

X^ = f Wo ( s )ds + ^=B t , 
Jo V n 

for a standard Brownian motion B. Consider estimating the function wq. 
More formally, the statistical experiment consists of the set of induced dis- 
tributions of the process X^ on the Borel <7-field of the space C[0, 1] of 
continuous functions equipped with the uniform norm, as the parameter w 
varies over a given subset of ^[0, 1]. 

Consider the prior on the model obtained by modeling the parameter w as 
the sample path of a Gaussian process W with values in the space L2[0, 1]. 
As this is conjugate, the practical implementation is straightforward. 

It is immediate from combining the preceding proposition with Theo- 
rem 2.1 that the rate of posterior contraction is determined by equations 
(1.2)-(1.3). 

Theorem 3.4. Let W be a zero-mean, tight Gaussian random element 
in L2[0, 1] and suppose that wq is contained in the support of W. Let 4> WQ 
be the function in (1.2) with \\ ■ \\ the ^[O, l]-norm. Then, the posterior dis- 
tribution satisfies EoII n (io: \\w — u>o||2 > Me n \X^) — ► for any sufficiently 
large constant M and e n given by (1.3). 

4. Examples of Gaussian priors. In this section, we give a number of 
examples of Gaussian process priors and compute their concentration func- 
tions (1.2) for "true parameters" of interest. We are especially interested 
in exhibiting processes that give the "correct" rates for true parameters of 
varying smoothness. 

4.1. Brownian motion and its primitives. For modeling functions on the 
one-dimensional unit interval, Brownian motion is a good starting point. It 
can be viewed as a map into the space C[0, 1], but also as a map in L r [0, 1]. 
This does not affect its RKHS and small ball probabilities, which are both 
well known. The RKHS of Brownian motion is the collection of absolutely 
continuous functions w : [0, 1] — > K with w(0) = and / w'(t) 2 dt < oo with 
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RKHS-norm ||io||h = ||w'||2- The small ball probabilities of Brownian motion 
satisfy (cf. [24]), as e [ 0, for any r E [l,oo], 

-logPr(||W|| r <e) x 

The support of Brownian motion in C[0, 1] is the set of all functions with 
w(0) = 0. Interestingly, the support as a map in L r [0, 1], for r < oo, is the 
full space L r [0, 1]. 

The sample paths of Brownian motion are tied down to at and this, of 
course, remains the case for the functions in its RKHS. This can be relaxed 
by starting the process at an independent standard normal variable. The 
RKHS of "Brownian motion started at random" is the space of functions 
w : [0,1] — ► M with / w'(t) 2 dt < oo and with square RKHS-norm \\w\\^ = 
w(0) 2 + \\w'\\l. 

The small probability leads, by way of (1.3), to the restriction e~ 2 < ne 2 , 
equivalently e n > n -1 / 4 , on the rate of contraction. 

The concentration function (1.2) further depends on the position of the 
true parameter wq relative to the RKHS. We may compute this contribution 
by approximation of wq through a kernel smoother. For <fi a (x) = a" 1 (b(x/a) 
a smooth kernel, the convolution wo * (b a is contained in the RKHS and has 
uniform distance of the order a 13 (as a — ► 0) to a function wq G C^[0, 1] that 
is Lipschitz of the order (3 G (0, 1] and has square RKHS-norm wq * 4> a (0) 2 + 
||(u> * (/v/Hi °f t ne order a~^ 2 ~ 2 ^> (see below). The choice a >c e 1 ^ readily 
yields that 

inf{||/ l '||M|/ l - U ; || oo <,}< e -( 2 - 2 ^. 

The concentration function (1.2) is the sum of this and the small ball ex- 
ponent (1/e) 2 . For (3 > 1/2, the contribution of the small ball probability 
to the concentration function (1.2) dominates and the rate of contraction 
e n is n _1//4 . For f3 G (0,1/2), the contribution as in the preceding display 
dominates and will yield a rate not faster than n~^l 2 . In particular, higher 
smoothness of the true parameter wq does not lead to a higher rate of con- 
traction than n -1 / 4 for (3 > 1/2. 

For this reason, or by intuition, Brownian motion may be considered to 
be too rough as a prior. Integrating the sample paths one or more times will 
remedy this and will give Gaussian priors of smoothness 3/2,5/2,.... To 
fill the gaps betweens these numbers, we consider, more generally, fractional 
integrals and fractional Brownian motion in the next sections. For ordinary 
integrals, the result is simpler and as follows. 

Define Io + f as the function t\— > Jg/(s)cis and /q+/ as -^d+(-^o+ V)- 

Theorem 4.1. Let W be a standard Brownian motion and Zq^.^Z^ 
independent standard normal random variables. The RKHS of the process 
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t*->lfi + W t + Y l i=o z it i /i ] - is the Sobolev space H k+1 [0,l] with square norm 
IHIh = ||^ fe+1 ^ III + J2i=o h,( l \0) 2 . The concentration function of this process 
viewed as a map in C[0, 1] for an element w £ C^[0, 1] for p < k + 1/2 
satisfies <j} w {e) = 0( e -( 2fc - 2 / 3 + 2 )// 3 ) as e 10. 

Proof. That the RKHS takes the present form is well known. See, for 
instance, [30] for a self-contained proof. 

The concentration function of Iq+W around zero satisfies 4>o(e) x e~ 1 /( fe + 1 / 2 ) 
as e — > by Theorem 2.1 of [23]. The concentration function of the process 
1 1 — > J2i=o Zit k is of the order log(l/e) and hence is much smaller. Thus, the 
concentration function of Iq<W + Ya=o Z^/il around zero is of the order 

£ -l/(fc+l/2)_ 

To compute the concentration function around a function w G C^O, 1], we 
utilize convolutions w * cf) a with a smooth higher-order kernel <f>„ with scale 
a. As in kernel density estimation, the uniform distance between w * <j) a 
and w is of the order The functions w * <p a belong to the RKHS. By 

writing (w * 4> a )^ = w^- * <fii- — for /3 the largest integer smaller than /?, 
we see that \\(w * (f>aY^\\oo is bounded above by a~( l ~^ if w E C^[0, 1] and 
I > P and hence that the RKHS-norm of w * 4> a is of the order a~^ 2k ~ 2/3+2 ^ 
ii we C^[0, 1]. Setting a X e 1 ^ , we see that 

Thus, the approximation part of is of the order £ -(2fc-2/3+2)//3^ p or 

/3 < k + 1/2, this dominates the part e~ 1 /( fc + 1 / 2 ) resulting from the centered 
small ball probability. □ 

For P = k + 1/2, the concentration function, in the preceding theorem 
becomes cp w (£) =e~ 1 ^. For this function, inequality (1.3) is solved by 

e n xn-^ +1 ). 

This is the minimax rate for estimating a function that is known to be 
/3-regular in various nonparametric models. Combination of the preceding 
theorem with the results on posterior contraction shows that the Gaussian 
prior in this case yields the optimal rate of convergence. For P ^ k + 1/2, 
the Gaussian prior gives consistency with a rate, but the minimax rate is 
not achieved. This corresponds to an under- or over-smoothed prior. 

Kimeldorf and Wahba [15] and Wahba [33] have considered priors of the 
type 1 1 — > Vbl^Wt + \faJ2i=o m the setting of the regression model 

Yi = w{xi) + ej. These priors are the same as in the preceding theorem, but 
with additional scaling factors Vb and ^fa. They show that if a — > oo and b 
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and n are fixed, then the posterior mean of the regression function tends to 
the minimizer w n of the penalized least squares criterion 

1 n a 2 f 1 

w h+ -J2(w(xi) - yi f + — w (k \t) 2 dt, 
n t-J no Jo 

where a 2 is the variance of the regression errors . Letting a tend to infinity 
has the purpose of making the prior on the finite-dimensional, polynomial 
part diffuse, while the infinite-dimensional part of the prior is fixed. The 
preceding theorem considers the rate of contraction of the full prior as n — > 
oo, for fixed a and b, and hence is not directly comparable to the results 
of Kimeldorf and Wahba. However, some intriguing observations can be 
made. The penalized least squares estimator is well known to be a smoothing 
spline and is known to achieve the minimax rate n - k /{ 2k + l ) for regression 
functions in // fc [0,l] when the "smoothing parameter" X n = a 2 /(nb) is set 
to satisfy A n x n ~2fc/(2fc+i)_ p^ wou i(j yield a scaling factor b x n ~ 1 ^ 2k+1 \ 
meaning that the infinite-dimensional part of the prior would tend to zero. 
In contrast, the preceding theorem shows that a fixed value of b yields a 
consistent posterior and a posterior achieving the optimal rate of contraction 
if the smoothness (3 of the true parameter is equal to the smoothness k+ 1/2 
of the prior. (The theorem does not allow a diffuse prior on the polynomial 
part, but it can be checked that the theorem remains true if the Gaussians 
in this polynomial part have variance tending slowly to infinity.) It may be 
noted that the preceding theorem appears to indicate that the prior works 
best for functions in H k+1 / 2 [0, 1], not H k [0, 1]. 

Wood and Kohn [34] implements the once-integrated Brownian motion 
prior within the setting of the binary regression model, where a large variance 
is used on the polynomial part. 

4.2. Riemann-Liouville process. For a > and W a standard Brownian 
motion, the Riemann-Liouville process with Hurst parameter a > is defined 
as 

R t = f\t-s) a ~ 1/2 dW s , t>0. 
Jo 

The process R is a centered Gaussian process with continuous sample paths. 
It can be viewed as a multiple of the (a + l/2)-fractional integral of the 
"derivative dW of Brownian motion." For a > and a (deterministic) mea- 
surable function / on [0,1], the (left-sided) Riemann-Liouville fractional 
integral of f of order a (if it exists) is defined as 

Lz + f(t) = -±- fit-sr^f^ds. 

1 (a) Jo 
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For a a natural number, the function I$ + f is just the a-fold iterated in- 
tegral of / and for a > 1/2, the Riemann-Liouville process is equal to 
r(a + 1/2)Iq, 1 W for Iff, the fractional integral. It can be shown that 
Iff, maps /3-regular functions into a + /^-regular functions (if a + j3 is not 
an integer; see [28]). Since Brownian motion is "regular of order 1/2," the 
Riemann-Liouville process R is a good model for "a-regular functions." This 
intuition is corroborated by the rate results in this section. For a proof of 
the following theorem, see Examples 9 and 15 in [13]. 

Theorem 4.2. The RKHS of the Riemann-Liouville process with pa- 
rameter a > viewed as a random element in C[0, 1] is M = Iq_^ 1 ^ 2 (L 2 [0, 1]) 
and the RKHS-norm is given by 



r(o + i/2)" 

The Riemann-Liouville process is appropriate for approximating C a - 
functions, except that its definition as an integral from means that its 
sample paths and their derivatives are tied down at zero. For a > and a 
the biggest integer smaller than a, we shall instead consider the process 

a+l 

(4.1) X t = Y / Zkt k + R?, 

k=0 

where Z±, . . . , Z a +i, R a are independent, is standard normal and R a is 
a Riemann-Liouville process with Hurst index a. As before, we view this 
process as a random element in C[0, 1]. 

Theorem 4.3. The support of the process X is the whole space C[0, 1]. 
For any w G C Q [0,1], the concentration function of X satisfies 4> w (e) = 
0{e- l ' a ) as e 10. 

The proof of this theorem is deferred to Section 5. For a not an integer, 
it can be seen by inspection of the proof that the theorem remains true if 
X is replaced by the process Sf=o ^kt k + Rf • 

For the concentration function 4> w (e) = e~ l l a , inequality (1.3) is solved by 
e n = n~ a ^ 2a+1 \ This is the minimax rate for estimating a function that is 
known to be a-regular in various nonparametric models. Combination of the 
preceding theorem with the results on posterior contraction therefore shows 
that the Gaussian prior (4.1) yields the optimal rate of convergence in various 
settings. This is true, for instance, in the settings of density estimation using 
a prior of the form 1 1— > ce Xt on the density, Gaussian regression using X as 
a prior regression function and classification using a prior ^f(Xt) on the 
probability Pr(Y = l\X = t). 
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4.3. Fractional Brownian motion. Fractional Brownian motion offers a 
different starting point for constructing a Gaussian process of a given smooth- 
ness level. By definition, fractional Brownian motion (fBm) with Hurst pa- 
rameter a E (0, 1) is the zero- mean Gaussian process X = (Xt :t€ [0, 1]) with 
continuous sample paths and covariance function 

EX s X t = \{s 2a + t 2a -\t- s\ 2a ). 

The choice a = 1/2 yields ordinary Brownian motion. To obtain a process 
of a given smoothness a > 1, we can take ordinary integrals of fractional 
Brownian motion. 

The conclusions using fractional Brownian motion are the same as for the 
Riemann-Liouville process. 

Theorem 4.4. Consider the fractional Brownian motion with Hurst pa- 
rameter a E (0,1) as a random element in C[0, 1]. For w E C Q [0,1] with 
w(0) = 0, we have <f> w (s) = 0(e~ 1 ^ a ) as e — > 0. 

Proof. For the fBm X , we have the representation 

X t = c a r ((t - s)T l/2 - (-sf + - 1/2 )dW s , 

where W is a double-sided Wiener process and c a a positive constant [25]. 
In other words, we have X = c a R + c a Z, where R and Z are independent 
processes, R is a RL-process with parameter a and Z is defined by 

Z t = f° ((i- S )°-V2_(_ s) a-l/2)^_ 
J — oo 

By Lemma 3.2 of [23], -logPr(||Z|| < e) = o(e~ l / a ) as e -> 0, with || • || the 
supremum norm, hence also with || • || the L2-norm. The theorem therefore 
follows from the results for the RL-process and Theorem 2.3. □ 

4.4. Truncated series. Any Gaussian variable in a separable Banach space 
can be expanded as an infinite series J2i Zihi f° r i-i.d. standard normal vari- 
ables Zi and elements hi from its RKHS. By Theorem 2.2, the prior obtained 
by truncating the series at a sufficiently high level will have the same concen- 
tration function and will hence lead to the same posterior rate of contraction. 
Because finite sums may be easier to handle, it is interesting to investigate 
special expansions and the numbers of terms that need to be retained in 
order to obtain the same contraction rate. In this section, we consider this 
question for fractional Brownian motion. 

By Theorem 4.4, the fBm with Hurst parameter a E (0, 1) as a prior for 
a true signal which is Holder continuous of order a leads to a concen- 
tration function satisfying 4> Wo (£) ^e -1 ^"- Consequently, inequality (1.3) is 
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satisfied for e n x n~ a ^ 1+2a \ This implies, for instance, that the rate of pos- 
terior contraction in the white noise model (see Theorem 3.4) is equal to the 
minimax rate relative to the L2-norm. 

By Theorem 2.2, for any series expansion X = J2 k Z^h^ of the fBm X, the 
truncated series X K = J2k=i ^khk also gives the optimal rate of posterior 
contraction if 

(4.2) 10E\\X K - X\\ 2 2 < -. 

n 

It is known that for any such expansion of the fBm, the truncated series X K 
satisfies (cf. [17]) 

E||X-X*|| 2 >(I)° 

A given expansion is therefore called rate-optimal (for the L2-norm) if E||X — 
^^Ib ^ K~ a . Several explicit rate-optimal expansions of the fBm are known 
(see, e.g., [2, 6, 7, 14]). For these rate-optimal expansions, (4.2) is fulfilled as 
soon as the number of terms in the expansion satisfies K = K n ^ Cn 1 /! 2 ") 
for a large constant C. This is somewhat larger than the dimension n l ^ 2a+1 ^ 
found in the following section, which also arises in the usual bias-variance 
trade-off of series estimators. 



4.5. Finite sums. Replacing the coefficients in a series expansion by 
Gaussian variables is a natural method to construct a Gaussian prior on 
a set of functions. In this section, we use truncated series and study the 
effect of varying the variances of the Gaussian variables. 

Series priors have been implemented in [20] in the density estimation 
model of Section 3.1 and in [21] in semiparametric regression, with Fourier- 
type series and coefficients with exponentially decreasing variances. The pri- 
ors are easy to implement in Gaussian regression. 

Because wavelet expansions give easy control of various norms, we con- 
sider here expansions 

oo 23 d 

j=lk=l 

of functions w : [0, l] d — > R on a double-indexed basis {i/ij,k '■ j = 1, 2, . . . , k = 
1, . . . , 2 jd } of bounded functions : [0, l] d — > R- (The unit cube could be 
replaced by another compact subset of R rf .) We consider these functions with 
the norms 

oo / \ 1/2 

IM|2 = £( zZ Kfcl 2 ' 

j=l Vl<fc<2i d / 
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oo 

w\\oo = 2^ d l 2 max \w~ ^ 

~i l<k<2i d 



IMIfl|oo,oo = SU P 2 j/3 2 jd/2 max \w jk \. 

1 l<j<oo l<k<2i d 

For the base functions tpj^ derived from suitable orthonormal wavelets in 
£2(0, l] d , these norms correspond to the L2" n orm, the supremum norm and 
the Besov (/?, 00, 00) -norm, respectively. The last norm measures smoothness 
of order j3, weaker than a Holder norm of the same order. 

For given truncation levels J a , which will tend to infinity with n, we 
consider a Gaussian prior of the type 

(4-3) W = f^J2^ Z ^,k, 

j=l k=l 

where the jjLj are positive numbers and the Zjk are i.i.d. standard nor- 
mal variables. The number of terms in the random series is 0(2 Jad ). For a 
transparent description of the main results, we set this number equal to the 
integer closest to the solution J a of the equation, for a given a > 0, 

2J a d _ n d/(2a+d) 

This dimension is well known to be the optimal dimension of a finite- 
dimensional model if the true parameter is known to be regular of order 
a. We next study the rate of posterior contraction if the true parameter is 
/3-regular under a variety of choices of the coefficients fij and for a general 
/3 > 0, which may be smaller or larger than the "nominal" value a. 

The contribution Wj = X)fc=i A i i^i,fcV'j',fc of the- jth. level to the prior sat- 
isfies 

^Wj\\l = np jd . 

Therefore, the choice fj,j = 2~ jfd//2 gives all levels the same amount of prior 
uncertainty. It is natural to choose the constants fij so that the numbers 
2^ d / 2 Hj are nonincreasing, but we shall allow these numbers to tend to zero 
as j — ► 00. If 2i d l 2 [ij — ► 0, then the higher levels receive less weight and 
hence the prior tends to be of lower dimension than the nominal dimension 
2 Jad . This may be advantageous if the true parameter is of higher regularity 
(i.e., P > a), for which the optimal dimension 2 J P d is indeed smaller. On 
the other hand, if the true parameter is less regular (i.e., (3 < a), then the 
nominal dimension 2 Jad is already too small and this would be exacerbated 
by putting lower weight on the higher levels. We shall show that the choice 
2i d / 2 [ij = 2~iP is a good compromise: it yields the optimal rate of contraction 
n -/3/(2f3+d) jf p y a anc j "optimal rate using a 2^ arf -dimensional model" 
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n -/3/{2a+d) if ^ < q,_ The choice 2 jfrf//2 /i ? - = 1, which gives equal weight to all 
levels, is no worse than this if /3 < a, but yields only the rate n - a /( 2a + d ) for 
P> a. 

To be precise, in the following, we establish these rates up to logarithmic 
factors. The proof of the following theorem can be found in Section 5. 



£n > 



Theorem 4.5. Let W be the Gaussian process given by (4.3) viewed as 
a map in £°°[0, l] d , with ^ d / 2 = 2~i a for some a > 0. Let w : [0, l] d -> K 
satisfy ||wo||/3|oo,oo < oo. Then, for 

n -/3/ (2a+d) lognj ifa</3<a, 
n - a /(2a+d) log n> ifa<a</3, 
n -a/(2a+d) (i Qg n y/(2a+d) ? if a < a < (3 , 

n -f3/(2a+d) ( log n jd/(2a+d) ^ if a < (3 < a, 

there exists a measurable set B n C ^°°[0, l] d such that 

(4.4) \ogN(3e n ,B n , \\ ■ IU) < 6Cne 2 n , 

(4.5) Pr(W<£B n ) <e~ Cn£ *, 

(4.6) Pr(||^-u;o|| 00 <4e„)>e- ne ™. 

5. Proofs. 

Proof of Theorem 2.1. Inequality (2.4) is an immediate consequence 
of (1.3) and (4.16) of [16]. We need to prove existence of the sets B n such 
that the first and second inequalities in the theorem hold. 

For Bi and Hi the unit balls in the Banach space B and the RKHS H, 
respectively, and M n a positive constant, set 

B n = e n M 1 + M n W 1 . 

By Borell's inequality ([3], Theorem 3.1), it follows that 

PT(W^B n )<l-^{a n + M n ) 

for <E> the distribution function of the standard normal distribution and a n 
determined by 

<f>(on) = Pr(W G e n Bi) = e - ^ 6 "*. 

For C > 1, set 

M n = -2<S>- 1 (e~ Cn£ «). 

Because (j)o(e n ) < 4> Wo (e n ) < ne n by assumption (1.3), and C > 1, we have 
that a n > —ijMn, whence a n + M n > \M n and 



Pr(VF £ B n ) < 1 - $(AM n ) = e - Cne ". 



GAUSSIAN PROCESS PRIORS 21 

We conclude that inequality (2.3) is satisfied. 

If hi, ... , hw are contained in M n Mi and are 2e n -separated for the norm 
|| • ||, then the || • ||-balls hj + e n Mi of radius e n around these points are 
disjoint and hence 

N 

1 >J2^(W £ hj + e n Mi) 
i=i 

N 

i=i 

> Ne~ {l/2)M "e~ M£n) , 

where the second inequality follows from (4.16) of [16]. If the 2e n -net hi, . . . , hp{ 
is maximal in the set M n Mi, then the balls hj + 2e n Mi cover M n EIi. It follows 
that 

N(2e n ,MrMi, || ■ ||) < N < e^A^oM. 

By its definition, any point of the set B n is within distance e n of some point 
of M n EIi. This implies that 

logJV(3e n ,B„,|| • ||) <logiV(2e n ,M n Mi,|| • ||) 

<\Ml + (j) Q {E n ) 

< 5Cnel + 4> (e n ), 

by the definition of M n if e~ Cne l < 1/2, because $ _1 (y) > -v/5/21og(l/y) 
and is negative for every y G (0,1/2). Since 4>o(e n ) < 4> wo (e n ) < ne 2 n , this 
concludes the verification of (2.2). □ 

PROOF of Theorem 2.2. As a consequence of BorelPs inequality (cf. 
[32], Proposition A2.1), we have 

PrfllW* - W\\ > £ n ) < 2e -^/8E||W-H/|| 2 < 2e -nel/(8/10) 

since E||iy n — W\\ 2 < l/(10n), by assumption. For e n satisfying ne\ > 4 log 4, 
the right-hand side is bounded above by \e~ nCn . Because Pr(||Ty n — wo|| < 
3e n ) > Pr(||W-itf || < 2e n ) - Pr(||W n - W\\ > e n ), it follows from (1.3) that 

Pr(||^ n - w Q \\ < 3e n ) > \e~ nel > e" 4n < 

This completes the verification of (2.4) with e n replaced by 2e n . 

We choose B n = 2e n Mi + M n W{ for H™ the unit ball of the RKHS asso- 
ciated to W n , and M n = — 2Q~ l (e~ Cn£n ), as in the proof of Theorem 2.1. 
Similarly to the observation in the preceding paragraph, we have 

e -^(2e n ) := p r (||^|| < 2e n ) > ie" n£ " > e- 4ne ". 
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The verification of (2.3)— (2.4) with 2e n instead of e n can now proceed exactly 
as in the proof of Theorem 2.1. For the first, we use that C > 4, so that again 
a n = $- 1 ( e -^o( 2e «)) > $- 1 ( e - 4ne «) > -\M n . For the second, we substitute 
the inequality (2e n ) < 4rae 2 . □ 

Proof of Theorem 2.3. If ||VP — vf \ < e for every i, then ||W — < 
e\I\, where |/| is the cardinality of /. Combined with the independence of the 
processes W\ this implies that Pr(||W — w\\ < e\I\) > FJ^ Pr(|| W- — w l \\ < e). 
In view of Theorem 2 of [16], the concentration function <j) w {e) of W is 
bounded above by twice the negative logarithm of the left-hand side, which 
is bounded above by 2Y] i (/)^ l (e), again by Theorem 2 of [16]. □ 

Proof of Theorem 2.4. It is easy to see that the RKHS EI a of the 
process aW for a fixed value of a is equal to the RKHS H of W, but with 
norm \\h\\ m = o -1 ||/i||h. We define B n = Ke n M 1 + KM n Wi = KB}, for B\ 
the set B n appearing in Theorem 2.1. Because A< K and B n is a cone, it 
is clear that Pr(AW B n ) < Pv(W $ B*) < e~ Cn£ ^, by Theorem 2.1. Also, 
N(3Ke n ,B n , || • ||) < N(3e n ,Bl, \\ ■ ||) < 6Cne 2 n , again by Theorem 2.1. 

By Theorem 2 of [16], for any fixed a and e > 0, 

-logPrfllaW-itfoll <2e) 

< mfdl/tll^ : \\h - w\\ <e}- logPr(||oVF|| < e) 

< \mf{\\h\\l:\\h - w\\ < e} - logPr(\\W\\ < e/K) 
a z 

< ^ W {e/K) 

for a > k and < k < 1 < K. We apply this with e/K = e n and then apply 
(1.3) to arrive at (2.7). □ 

For the proof of Theorem 4.3, we first recall some facts from fractional 
calculus, which can be found in [28]. 

Using Fubini's theorem, it can be seen that the fractional integration 
operators have the semigroup property Iq + Iq + = ig+ ■ The fractional inte- 
gration operator acts on power functions as one would expect: for a > 0, 
P>-\ and f{t) = t l3 1 

7o+/(t) -r(a + /? + i) t • 

For a G (0, 1), the (left-sided) Riemann-Liouville fractional derivative of f 
of order a is defined by 

d f /. n / \ . d i_ a 



D*f(t) = — -— / (t-sy a f(s)ds = —Kl a f(t), 

0+Jyj T(l-a)dtJ dt 0+Jyj 
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provided it exists. To define the fractional derivative for a > 1, we intro- 
duce the notation [a] and {a} for the integer and fractional parts of a, 
respectively. For general a > 0, we define 



In particular, D$ + f is just the crth derivative of / if a is an integer. Observe 
that Dq + J equals the nth derivative of lQ+ a f, provided it exists, with n = 
a + 1. We say that / has a summable fractional derivative D$ + f if /q-T a f nas 
n — 1 continuous derivatives and the (n — l)th derivative is only absolutely 
continuous rather than differentiable. 

Fractional integration and differentiation are inverse operations, in the 
sense that Dq + Iq + = Id. However, in general, Dq + is not the right inverse 
of Iq + . If / € L\ has a summable derivative of order a > 0, then, with 
n = [a] + 1, 



1 D V fc - 1 (/ V/ )(o) 
r(a - k) 



Lemma 5.1. Suppose that f is twice continuously differentiable and 
f(0) = 0. For a S (1, 2), the function f has a summable fractional derivative 
Z?o+/ an d can be written as f = lQ + D$ + f . Furthermore, 

D% + m = il_Q + 'o 2 r/"(*)- 

Proof. Since /(0) = 0, we have f(t) = f'(0)t + Io+f"(t), whence 

Differentiating this twice and using the identity T(l + x) = xT(x) yields the 
expression for D$ + f. The formula preceding the lemma gives 

f(+\ — ja n a f f + \ i A)+ '(-^0+ /)W -xa-fc-1 



fc=0 



For fc = 1, we get /^."/(O) m the numerator, which vanishes since / is 
continuous. For k = 0, we get D^^f^). But since /(0) = 0, we have / = 
loV and hence D Q + 7(0) = / + a /'(0) = 0- □ 

Roughly speaking, fractional integration of order a improves the smooth- 
ness of a function by a. More precisely, for AG [0, 1] and a £ (0, 1) such 
that A + a / 1, it holds that I§ + :C#[0, 1] C Q+A [0, 1], where C#[0, 1] are 
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the functions / G C [0, 1] with /(0) = 0. The analogous assertion is true for 
the fractional derivative. If < a < A < 1, then C#[0, 1] C iff + (Li[0, 1]). On 
the space Jq + (Li[0, 1]), the Riemann-Liouville fractional derivative Dq + co- 
incides with the so-called Marchaud fractional derivative of order a. The 
latter maps C#[0,1] into C A ~ Q [0,1]. Hence, for < a < A < 1, we have 
D a + :C A [0,1]^C A ^[0,1]. 

In the following lemma, we use the customary notation Iq + = Dq? for 
a < 0. 

Lemma 5.2. Lei A G [0, 1] and a G [0, 1) 6e suc/i i/iai a + A G (0, 2) and 
a + A/1. // / G C A [0, 1] and g G L\(M.) has compact support and satisfies 
J g(u) du = and, in the case that a + A > 1, also J ug(u) du = 0, then 

\\Io + (f*9)\\oo< J \u\ a+x \g(u)\du. 

Proof. The conditions on g imply that 

(f*9)(s)= J(f(s-u)-f(s))g(u)du for sG (0,1) 

and we may assume that /(0) = 0. A change of variables shows that for 
u G K and t G (0, 1), we have 

^ J\t - sr- l f(s -u)ds = I$ + f(t - u), 

the right-hand side vanishing by definition if u > t. Using the fact that g has 
compact support to justify the interchanging of integrals, it follows that 

(5.1) I° + (J*g)(t) = J({I% + m-u)-(I% + m))g{u)du. 

Because I§ + : C A [0, 1] -» C a+X [0, 1], we have, for a + A > 1, 

\(i 5 0+ m-u) - (i 5 0+ m)+ U (iz + fnt)\ < \ u \ a+ \ 

Inserting this in the preceding display completes the proof in this case. If 
a + A < 1, then the preceding display is satisfied with the factor u(lQ + f)'(t) 
omitted and the proof is completed as before. □ 

Proof of Theorem 4.3. Let Z = X — R a be the polynomial part of 
X given in (4.1). 

By Theorem 2.1 of [23], — logPrdl^Hoo < e) behaves as a constant times 
e~ l / a as e — ► 0. Because each of the probabilities PrdlZ^t^Hoo < e) behaves 
as a constant times e as e — ► 0, — logPrdl^lloo < e) is bounded above by a 
constant times log(l/e), which is much smaller than e~ l / a . 
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In view of Theorem 2.3, the concentration function (f> w (2s) of X is bounded 
by a multiple of the sum <j> w _p(2e\ R a ) + (pp(2e; Z) of the concentration func- 
tions of R a and Z, where w = w — P + P may be an arbitrary split. The 
RKHS of the process Z is the set of polynomials P% = J2f=o with square 
norm H-P^Hg = J2f=i Therefore, for any such polynomial, 

a+l 

^( e ;Z)<]T4 2 + log(l/4 
1=1 

We shall apply this with polynomials such that (p w -p(2e; R a ) becomes suit- 
ably small. 

Let 4> be a smooth, compactly supported, order-a V 2 kernel and, for 
a > 0, define (j) a {t) = cr" 1 cj)(t / a) . We note that, automatically, / 4>'(t)dt = 
J 4>"{t)dt = J t(j)"(t)dt = 0. Since w £ C a , we have \\w - w * ^Hoo < cr Q , 
whence \\w — w * <^ CT ||oo < e if a = Ce l / a for an appropriate constant C . 

Let 7 = {a} E (0, 1] be the fractional part of a. We first consider the case 
7 G (0, 1/2]. By Taylor's theorem, 

^ U\ ^ (^*(f) a )(0) k a+i (a) , M 

^*^( i ) = Z^ £j * +I o+ (™ *<P<r) 

k=0 

= t ( " w ;f- )(0) t fc + * «o, 

fc=o K - 

by the semigroup property of fractional integrals. The first function on the 
right is a polynomial P a and the sum of squares of its coefficients can be 
seen to be bounded for small a. By Theorem 4.2, the second function on 
the right belongs to the RKHS of R a and has squared RKHS-norm equal to 
\\Io+ " 7 ( - } * <j)' a )\\l, which is O^ 1 ) = O^ 1 / ), by Lemma 5.2. We now 
split w = w — P a + P a and approximate w — P a with the function w * <j) a — P a 
in the RKHS of R a . 

In the case where 7 G (1/2,1], we apply Lemma 5.1, with the a and 
function / in the lemma taken equal to the present 7 + 1/2 and io+( w * 
4>' a ), to obtain that 

I 1 0+ (w^*<f>' a )=l£ 1/2 (g 1 +g 2 ), 

with 

M*> ♦&)(()) /2 _, 

91 W " r(3/2- 7 ) 

92(t)=/^ 2 " 7 (^*^)(t). 
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By integrating the penultimate display a times, we obtain lf + (w—* (f)' a ) = 
jq+i/2^ -\- g 2 ) and hence, by Taylor's theorem, 

w*cj) a (t) = }_^ 1 +/ 0+ 5i+^o + 52- 

fc=o K - 

Since 52 is square integrable, the third term on the right belongs to the 
RKHS of R a , with squared RKHS-norm equal to ||<?2||| — ll-^o+ 2 7 (^— ■* * 
^ct)(^)IIoo- This is 0(<7~^ ) , by Lemma 5.2. The sum of the first two terms 
is a polynomial of degree a + 1 and the sum of squares of its coefficients is 
bounded by a constant times 

J((J fe )*^)(0)) 2 + ((^*^)(0)) 2 . 

k=0 

The first term is bounded, while the second term is of order a 2ry ~ 2 , which is 
0(a~ l ), since 7 > 1/2. □ 



Proof of Theorem 4.5. The index fc, when nested within a sum over 
j in the following, is to be understood to range over all possible values 
1,2, .. . ,23 d . The reproducing kernel Hilbert space of the variable W is the 
set of functions w = J2j=i J2k w j,k' l Pj,k with 

11 11 2 jib , 

\\ w \\u = 2^2^-j- <0 °- 
j=i k n 

For a fixed integer J < J a (to be determined later), the projection Wq = 
2~2j=i2~2k w o-,j,k' l Pj,k is clearly contained in the RKHS, whence, for any e > 0, 

inf-tllwll 2 ! : \\w - Wq ||oo < e} 

j 2 j 

^ \U,J\\1 \ " \ " W 0;j,k ^ \ ^ n j(2a-20+d) ||„„ ||2 
< II^oIIh-Z^Z^ - T2 - - Z^ 2 IFO ll/3|oo,oo 

j=l fc ' i=l 

for //j2 J<i / 2 = 2~ jfa . For any numbers ay > with Y^j=i a j — 1) we have 
Prdl^lU < e) = Pr^2 J ' d / 2 max|^ i Z,-, fc | < ^ 

>fin pr (i^ 2id/2 ^i<«^)- 

j=i k 
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Therefore, for [ip? d l 2 = 2 Ja and ay = (K + d 2 j 2 ) 1 and a large constant 
K, it follows that 

-logPrdl^lU <e n ) < -X]2^1og(2$(a i e n 2^) - 1) 



To justify the last step, we may choose the constant A" sufficiently large that 
the function n->x a / d /{K + \og 2 2 x) is nondecreasing on [l,oo). 

The function / : [0, oo) — > E defined by /(y) = — log(2$(y) — 1)) is decreas- 
ing from oo at y = to at y = oo. It is bounded above by a multiple of 
1 + 1 log y\ for y in an interval [0, c] and bounded above by a multiple of e~ y I 2 
for y > c. [For the latter note that f{y) = — 2</>(y)/(23>(y) — 1) is bounded 
above in absolute value by 2</>(y)/(23>(c) — 1) for y > c so that f(y) = 
/(oo) - f'{x) dx is bounded in absolute value by 2(1 - $(y))/(2*(c) - 1) 
on this interval.] 

We consider two cases to further bound the integral in the last display. For 
e n 2 Jaa < (K + J 2 d 2 ), the argument {e n x a / d ) / {K + \og\x) is bounded above 
by a constant on the integration interval [l,2^ arf ] and hence the function / 
in the integral can be bounded above by a multiple of 1 + | log | , yielding as 
upper bound a multiple of 

t\ i + ]i °4yt^)) ^■«(io g( i/,„) + j.). 

Whenever a > and, in particular, if e n 2 Jo,a > (K + J 2 d 2 ), we can change 
variables e n x a / d = y and rewrite the integral as 

LY /a r 2Jaa f( V Y-y^dy 

sj Je n J \K + {d/a) 2 {\og 2 y + \og 2 {l/e n )) 2 )a V V ' 

The integral in this expression is bounded above by 



rl/e n 

[Jo f \K+(2d/a) 2 log 2 (l/e n ), 

_/°° f ( V 

Ji/s n \K+(2d/a) 2 log 2 yJ\a 

<4 /a / f{x)~x d l a - l dx 
Jo a 

+ r f ( y. Yv d/a - i d V 

Jo J \K+(2d/a) 2 log 2 2 yJa 9 



V 
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for fM n = K + (2d/a) 2 log|(l/e n ). The integral in the first term on the right is 
bounded as re — > oo, whence the whole expression is bounded by a multiple 

of (log(l/e n )) M/a - 

Combining the preceding, we conclude that 

^wJ^n) = inf{IMlH: \\w -tUoHoo < ^n} -logPr(||W||oo < s n ) 

j r2 J ^(log(l/e n ) + J a ), ]£e n 2 J - a <J*, 

~U [[-) (log(l/e n )) 2d / a , ife n 2^>J 2 . 

The display gives the concentration function at the projection Wq. By The- 
orem 2.1, there exist measurable sets B n satisfying the three assertions of 
Theorem 4.5, but with wq replaced by Wq and the 4 in the last condition 
replaced by 2. Since 

oo 

||«>0 -Wq Woo < 2 jd/2 max \w ;j,k\ 

3=J+1 k 

oo 

j=J+l 

we have the three assertions of Theorem 4.5 as given as soon as 

<P W J (e n ) < ne 2 n and 2~ J/3 < e n . 

The proof is completed by verifying that e n as given in the theorem satisfies 
these inequalities in the various cases, for suitable J (set J = J a if a < a 
and J = J a otherwise). We omit the (tedious) derivation of this. □ 
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